asymptotically stable
- Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
- Europe > Germany > Brandenburg > Potsdam (0.04)
- North America > Canada > British Columbia (0.04)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- North America > United States (0.14)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
- Asia > Singapore (0.04)
- North America > United States > New Jersey > Mercer County > Princeton (0.04)
- (5 more...)
- Leisure & Entertainment > Games (0.47)
- Education (0.46)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- North America > United States > New Jersey (0.04)
- North America > United States > California > Los Angeles County > Long Beach (0.04)
Multistability of Self-Attention Dynamics in Transformers
In machine learning, a self-attention dynamics is a continuous-time multiagent-like model of the attention mechanisms of transformers. In this paper we show that such dynamics is related to a multiagent version of the Oja flow, a dynamical system that computes the principal eigenvector of a matrix corresponding for transformers to the value matrix. We classify the equilibria of the ``single-head'' self-attention system into four classes: consensus, bipartite consensus, clustering and polygonal equilibria. Multiple asymptotically stable equilibria from the first three classes often coexist in the self-attention dynamics. Interestingly, equilibria from the first two classes are always aligned with the eigenvectors of the value matrix, often but not exclusively with the principal eigenvector.
- Europe > Sweden (0.04)
- Asia > Vietnam > Long An Province (0.04)
Dual Perspectives on Non-Contrastive Self-Supervised Learning
Ponce, Jean, Terver, Basile, Hebert, Martial, Arbel, Michael
The stop gradient and exponential moving average iterative procedures are commonly used in non-contrastive approaches to self-supervised learning to avoid representation collapse, with excellent performance in downstream applications in practice. This presentation investigates these procedures from the dual viewpoints of optimization and dynamical systems. We show that, in general, although they do not optimize the original objective, or any other smooth function, they do avoid collapse Following Tian et al. (2021), but without any of the extra assumptions used in their proofs, we then show using a dynamical system perspective that, in the linear case, minimizing the original objective function without the use of a stop gradient or exponential moving average always leads to collapse. Conversely, we characterize explicitly the equilibria of the dynamical systems associated with these two procedures in this linear setting as algebraic varieties in their parameter space, and show that they are, in general, asymptotically stable . Our theoretical findings are illustrated by empirical experiments with real and synthetic data. Self-supervised learning (or SSL) is an approach to representation learning that exploits the internal consistency of training data without requiring expensive annotations. However, non-contrastive approaches to SSL (Assran et al., 2023; Bardes et al., 2022) that take as input different views of the same data samples and learn to predict one view from the other, are susceptible to representational collapse where a constant embedding is learned for all data points (LeCun, 2022). We use in this presentation the dual viewpoints of optimization and dynamical systems to study theoretically and empirically the well-known stop gradient (Chen and He, 2021) and exponential moving average (Grill et al., 2020) training procedures that are specifically designed to avoid this problem. Here C is the global minimum of E (θ,ψ) (shown as negative instead of zero for readibility) associated with a collapse of the training process; B is a nontrivial local minimum one may reach using an appropriate regularization to avoid collapse; and A is a limit point of the stop gradient (SG) training procedure associated with parameters θ and ψ at convergence. In general, it is not a minimum of E and thus does not correspond to a collapse of the training process, but it is a minimum with respect to ψ of E ( θ,ψ).
- Europe > France > Auvergne-Rhône-Alpes > Isère > Grenoble (0.04)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- Europe > Latvia > Lubāna Municipality > Lubāna (0.04)
- North America > United States (0.14)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
- Europe > Germany > Brandenburg > Potsdam (0.04)
- North America > Canada > British Columbia (0.04)
- (2 more...)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
- Asia > Singapore (0.04)
- North America > United States > New Jersey > Mercer County > Princeton (0.04)
- (5 more...)
- Leisure & Entertainment > Games (0.47)
- Education (0.46)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
- North America > United States > New York > New York County > New York City (0.05)
- Asia > Singapore (0.05)
- (6 more...)